{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# How to optimize the code?\n",
    "\n",
    "In this notebook I want to show how to write efficient code and how cython and C code can help to improve the speed.  \n",
    "I decided to consider as example the SOR algorithm. Here we can see how the algorithm presented in the Notebook **A.1 Solution of linear equations** can be modified for our specific needs (i.e. solving PDEs). \n",
    "\n",
    "Again, if you are curious about the SOR and want to know more, have a look at the wiki page [link](https://en.wikipedia.org/wiki/Successive_over-relaxation).\n",
    "\n",
    "## Contents\n",
    "   - [Python implelentation](#sec1)\n",
    "   - [Cython](#sec2)\n",
    "   - [C code](#sec3)\n",
    "      - [BS python vs C](#sec3.1) "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import subprocess\n",
    "import numpy as np\n",
    "import scipy as scp\n",
    "from scipy.linalg import norm\n",
    "from functions.Solvers import SOR, SOR2\n",
    "%load_ext cython\n",
    "import Cython"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "N = 3000\n",
    "aa = 2; bb = 10; cc = 5\n",
    "A = np.diag(aa * np.ones(N-1), -1) + np.diag(bb * np.ones(N), 0) + np.diag(cc * np.ones(N-1), 1)\n",
    "x = 2 * np.ones(N)\n",
    "b = A@x"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here we use a tridiagonal matrix A \n",
    "\n",
    "$$ \\left(\n",
    "\\begin{array}{ccccc}\n",
    "bb     & cc     & 0      & \\cdots & 0 \\\\\n",
    "aa      & bb     & cc     & 0      & 0  \\\\\n",
    "0      & \\ddots & \\ddots & \\ddots & 0  \\\\\n",
    "\\vdots & 0      & aa     & bb     & cc  \\\\\n",
    "0      & 0      & 0      & aa     & bb \\\\\n",
    "\\end{array}\n",
    "\\right) $$\n",
    "\n",
    "with equal elements in the three diagonals:   \n",
    "\n",
    "$$ aa = 2, \\quad bb = 10, \\quad cc = 5 $$\n",
    "\n",
    "This is the case of the Black-Scholes equation (in log-variables).   \n",
    "The matrix A is quite big because we want to test the performances of the algorithms.\n",
    "\n",
    "The linear system is always the same: \n",
    "\n",
    "$$ A x = b$$\n",
    "\n",
    "For simplicity I chose $x = [2,...,2]$. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id='sec1'></a>\n",
    "## Python implementation\n",
    "\n",
    "I wrote two functions to implement the SOR algorithm with the aim of solving PDEs. \n",
    " - ```SOR``` uses matrix multiplications. The code is the same presented in the notebook **A1**: First it creates the matrices D,U,L (if A is sparse, it is converted into a numpy.array). Then it iterates the solutions until convergence.  \n",
    " - ```SOR2``` iterates over all components of $x$ . It does not perform matrix multiplications but it considers each component of $x$ for the computations.     \n",
    "The algorithm is the following:   \n",
    "\n",
    "```python\n",
    "    x0 = np.ones_like(b, dtype=np.float64) # initial guess\n",
    "    x_new = np.ones_like(x0)               # new solution\n",
    "    \n",
    "    for k in range(1,N_max+1):           # iteration until convergence\n",
    "        for i in range(N):               # iteration over all the rows\n",
    "            S = 0\n",
    "            for j in range(N):           # iteration over the columns\n",
    "                if j != i:\n",
    "                    S += A[i,j]*x_new[j]\n",
    "            x_new[i] = (1-w)*x_new[i] + (w/A[i,i]) * (b[i] - S)  \n",
    "                   \n",
    "        if norm(x_new - x0) < eps:       # check convergence\n",
    "            return x_new\n",
    "        x0 = x_new.copy()                # updates the solution \n",
    "        if k==N_max:\n",
    "            print(\"Fail to converge in {} iterations\".format(k))\n",
    "```\n",
    "This algorithm is taken from the SOR wiki [page](https://en.wikipedia.org/wiki/Successive_over-relaxation) and it is equivalent to the algorithm presented in the notebook **A1**.\n",
    "\n",
    "Let us see how fast they are: (well... how **slow** they are... be ready to wait about 6 minutes)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 10.2 s, sys: 15.7 s, total: 25.9 s\n",
      "Wall time: 8.33 s\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "array([2., 2., 2., ..., 2., 2., 2.])"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%%time\n",
    "SOR(A,b)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 6min 23s, sys: 560 ms, total: 6min 24s\n",
      "Wall time: 6min 24s\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "array([2., 2., 2., ..., 2., 2., 2.])"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%%time\n",
    "SOR2(A,b)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## TOO BAD!\n",
    "\n",
    "The second algorithm is very bad. There is an immediate improvement to do:  \n",
    "We are working with a **tridiagonal matrix**. It means that all the elements not on the three diagonals are zero. The first piece of code to modify is OBVIOUSLY this:\n",
    "```python\n",
    "for j in range(N):           # iteration over the columns\n",
    "    if j != i:\n",
    "        S += A[i,j]*x_new[j]\n",
    "``` \n",
    "There is no need to sum zero elements.  \n",
    "Let us consider the new function:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "def SOR3(A, b, w=1, eps=1e-10, N_max = 100):\n",
    "    N = len(b)\n",
    "    x0 = np.ones_like(b, dtype=np.float64) # initial guess\n",
    "    x_new = np.ones_like(x0)               # new solution\n",
    "    for k in range(1,N_max+1):\n",
    "        for i in range(N):\n",
    "            if (i==0):                     # new code start  \n",
    "                S = A[0,1] * x_new[1]\n",
    "            elif (i==N-1):\n",
    "                S = A[N-1,N-2] * x_new[N-2]\n",
    "            else:\n",
    "                S = A[i,i-1] * x_new[i-1] + A[i,i+1] * x_new[i+1]\n",
    "                                           # new code end \n",
    "            x_new[i] = (1-w)*x_new[i] + (w/A[i,i]) * (b[i] - S)  \n",
    "        if norm(x_new - x0) < eps:\n",
    "            return x_new\n",
    "        x0 = x_new.copy()\n",
    "        if k==N_max:\n",
    "            print(\"Fail to converge in {} iterations\".format(k))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "571 ms ± 12.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
     ]
    }
   ],
   "source": [
    "%%timeit\n",
    "SOR3(A,b)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### OK ... it was easy!\n",
    "\n",
    "But wait a second... if all the elements in the three diagonals are equal, do we really need a matrix?   \n",
    "Of course, we can use sparse matrices to save space in memory. But do we really need any kind of matrix?  \n",
    "The same algorithm can be written considering just the three values $aa,bb,cc$.   \n",
    "\n",
    "**In the following algorithm, even if the gain in speed is not so much, we save a lot of space in memory!!** "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "def SOR4(aa, bb, cc, b, w=1, eps=1e-10, N_max = 100):\n",
    "    N = len(b)\n",
    "    x0 = np.ones_like(b, dtype=np.float64) # initial guess\n",
    "    x_new = np.ones_like(x0)               # new solution\n",
    "    for k in range(1,N_max+1):\n",
    "        for i in range(N):\n",
    "            if (i==0):\n",
    "                S = cc * x_new[1]\n",
    "            elif (i==N-1):\n",
    "                S = aa * x_new[N-2]\n",
    "            else:\n",
    "                S = aa * x_new[i-1] + cc * x_new[i+1]\n",
    "            x_new[i] = (1-w)*x_new[i] + (w/bb) * (b[i] - S)  \n",
    "        if norm(x_new - x0) < eps:\n",
    "            return x_new\n",
    "        x0 = x_new.copy()\n",
    "        if k==N_max:\n",
    "            print(\"Fail to converge in {} iterations\".format(k))\n",
    "            return x_new"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "430 ms ± 4.06 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
     ]
    }
   ],
   "source": [
    "%%timeit\n",
    "SOR4(aa,bb,cc,b)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id='sec2'></a>\n",
    "## Cython\n",
    "\n",
    "\n",
    "For those who are not familiar with Cython, I suggest to read this introduction [link](https://cython.readthedocs.io/en/latest/src/userguide/numpy_tutorial.html).\n",
    "\n",
    "Cython, basically, consists in adding types to the python variables. \n",
    "\n",
    "Let's see what happens to the speed when we add types to the previous pure python function (SOR4)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%cython\n",
    "import numpy as np\n",
    "from scipy.linalg import norm\n",
    "cimport numpy as np  \n",
    "cimport cython\n",
    "\n",
    "@cython.boundscheck(False)    # turn off bounds-checking for entire function\n",
    "@cython.wraparound(False)     # turn off negative index wrapping for entire function\n",
    "def SOR_cy(np.float64_t aa, \n",
    "              np.float64_t bb, np.float64_t cc, \n",
    "              np.ndarray[np.float64_t , ndim=1] b, \n",
    "              double w=1, double eps=1e-10, int N_max = 100):\n",
    "    \n",
    "    cdef unsigned int N = b.size\n",
    "    cdef np.ndarray[np.float64_t , ndim=1] x0 = np.ones(N, dtype=np.float64)     # initial guess\n",
    "    cdef np.ndarray[np.float64_t , ndim=1] x_new = np.ones(N, dtype=np.float64)  # new solution\n",
    "    cdef unsigned int i, k\n",
    "    cdef np.float64_t S\n",
    "    \n",
    "    for k in range(1,N_max+1):\n",
    "        for i in range(N):\n",
    "            if (i==0):\n",
    "                S = cc * x_new[1]\n",
    "            elif (i==N-1):\n",
    "                S = aa * x_new[N-2]\n",
    "            else:\n",
    "                S = aa * x_new[i-1] + cc * x_new[i+1]\n",
    "            x_new[i] = (1-w)*x_new[i] + (w/bb) * (b[i] - S)  \n",
    "        if norm(x_new - x0) < eps:\n",
    "            return x_new\n",
    "        x0 = x_new.copy()\n",
    "        if k==N_max:\n",
    "            print(\"Fail to converge in {} iterations\".format(k))\n",
    "            return x_new"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1.98 ms ± 11.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n"
     ]
    }
   ],
   "source": [
    "%%timeit\n",
    "SOR_cy(aa,bb,cc,b)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### About 100 times faster!!!\n",
    "\n",
    "That's good.\n",
    "\n",
    "So... those who are not familiar with Cython maybe are confused about the new type `np.float64_t`. We wrote: \n",
    "```python\n",
    "import numpy as np\n",
    "cimport numpy as np  \n",
    "```  \n",
    "The first line imports numpy module in the python space. \n",
    "It only gives access to Numpy’s pure-Python API and it occurs at runtime.\n",
    "\n",
    "The second line gives access to the Numpy’s C API defined in the `__init__.pxd` file during compile time.\n",
    "\n",
    "Even if they are both named `np`, they are automatically recognized.\n",
    "In `__init__.pdx` it is defined:\n",
    "```\n",
    "ctypedef double       npy_float64\n",
    "ctypedef npy_float64    float64_t\n",
    "``` \n",
    "The `np.float64_t` represents the type `double` in C."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Memoryviews\n",
    "\n",
    "Let us re-write the previous code using the faster [memoryviews](https://cython.readthedocs.io/en/latest/src/userguide/memoryviews.html).   \n",
    "I suggest to the reader to have a fast look at the memoryviews manual in the link. There are no difficult concepts and the notation is not so different from the notation used in the previous function. \n",
    "\n",
    "Memoryviews is another tool to help speed up the algorithm.\n",
    "\n",
    "I have to admit that when I was writing the new code I realized that using the function `norm` is not the optimal way. (I got an error because `norm` only accepts ndarrays... so, thanks memoryviews :)  ).  \n",
    "Well, the `norm` function computes a square root, which still requires some computations.  \n",
    "We can define our own function `distance2` (which is the square of the distance) that is compared with the square of the tolerance parameter `eps * eps`. This is another improvement of the algorithm."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%cython\n",
    "import numpy as np\n",
    "cimport numpy as np\n",
    "cimport cython\n",
    "\n",
    "cdef double distance2(double[:] a, double[:] b, unsigned int N):\n",
    "    cdef double dist = 0\n",
    "    cdef unsigned int i \n",
    "    for i in range(N):\n",
    "        dist += (a[i] - b[i]) * (a[i] - b[i])\n",
    "    return dist\n",
    "\n",
    "@cython.boundscheck(False)\n",
    "@cython.wraparound(False)\n",
    "def SOR_cy2(double aa, \n",
    "              double bb, double cc, \n",
    "              double[:] b, \n",
    "              double w=1, double eps=1e-10, int N_max = 200):\n",
    "    \n",
    "    cdef unsigned int N = b.size    \n",
    "    cdef double[:] x0 = np.ones(N, dtype=np.float64)          # initial guess\n",
    "    cdef double[:] x_new = np.ones(N, dtype=np.float64)       # new solution\n",
    "    cdef unsigned int i, k\n",
    "    cdef double S\n",
    "    \n",
    "    for k in range(1,N_max+1):\n",
    "        for i in range(N):\n",
    "            if (i==0):\n",
    "                S = cc * x_new[1]\n",
    "            elif (i==N-1):\n",
    "                S = aa * x_new[N-2]\n",
    "            else:\n",
    "                S = aa * x_new[i-1] + cc * x_new[i+1]\n",
    "            x_new[i] = (1-w)*x_new[i] + (w/bb) * (b[i] - S)  \n",
    "        if distance2(x_new, x0, N) < eps*eps:\n",
    "            return np.asarray(x_new)\n",
    "        x0[:] = x_new\n",
    "        if k==N_max:\n",
    "            print(\"Fail to converge in {} iterations\".format(k))\n",
    "            return np.asarray(x_new)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1.17 ms ± 18.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)\n"
     ]
    }
   ],
   "source": [
    "%%timeit\n",
    "SOR_cy2(aa,bb,cc,b)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Good job!! Another improvement!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id='sec3'></a>\n",
    "## C code\n",
    "\n",
    "The last improvement is to write the function in C code and call it from python.  \n",
    "Inside the folder `functions/C` you can find the header file `SOR.h` and the implementation file `SOR.c` (you will find also the `mainSOR.c` if you want to test the SOR algorithm directly in C).    \n",
    "I will call the function `SOR_abc` declared in the header `SOR.h`.  \n",
    "First it is declared as extern, and then it is called inside `SOR_c` with a cast to `<double[:arr_memview.shape[0]]>`.\n",
    "\n",
    "If you are using docker the next function will compile correctly. If you are not using docker you have to replace the INCLUDE_PATH with the output of the next cell:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "INCLUDE_PATH =  /home/jovyan/work/functions/C\n"
     ]
    }
   ],
   "source": [
    "print(\"INCLUDE_PATH = \", os.getcwd() + \"/functions/C\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%cython -I /home/jovyan/work/functions/C\n",
    "# With docker no need to modify anything.\n",
    "#\n",
    "# If you are not using docker, just replace the line above, \n",
    "# with the line below and replace INCLUDE_PATH:\n",
    "# \n",
    "# %%cython -I INCLUDE_PATH\n",
    "#\n",
    "# The %%cython directive must be the first keyword in the cell\n",
    "\n",
    "cdef extern from \"SOR.c\":\n",
    "    pass\n",
    "cdef extern from \"SOR.h\":\n",
    "    double* SOR_abc(double, double, double, double *, int, double, double, int)\n",
    "\n",
    "import numpy as np\n",
    "cimport cython\n",
    "\n",
    "@cython.boundscheck(False)\n",
    "@cython.wraparound(False)\n",
    "def SOR_c(double aa, double bb, double cc, B, double w=1, double eps=1e-10, int N_max = 200): \n",
    "\n",
    "    if not B.flags['C_CONTIGUOUS']:\n",
    "        B = np.ascontiguousarray(B) # Makes a contiguous copy of the numpy array\n",
    "        \n",
    "    cdef double[::1] arr_memview = B    \n",
    "    cdef double[::1] x = <double[:arr_memview.shape[0]]>SOR_abc(aa, bb, cc, \n",
    "                                            &arr_memview[0], arr_memview.shape[0], \n",
    "                                            w, eps, N_max)\n",
    "    return np.asarray(x)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1.21 ms ± 15.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)\n"
     ]
    }
   ],
   "source": [
    "%%timeit\n",
    "SOR_c(aa,bb,cc,b)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Well... it looks like that using Cython with memoryviews has the same performances as wrapping a C function.\n",
    "\n",
    "For this reason, I used the cython version as solver in the class `BS_pricer`.  \n",
    "We already compared some performances in the notebook **1.2 - BS PDE**, and we saw that the SOR algorithm is slow compared to the LU or Thomas algorithms.  \n",
    "Just for curiosity, let us compare the speed of the python PDE_price method implemented with cython SOR algorithm, and a pricer with same SOR algorithm fully implemented in C."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "from functions.Parameters import Option_param\n",
    "from functions.Processes import Diffusion_process\n",
    "from functions.BS_pricer import BS_pricer\n",
    "\n",
    "opt_param = Option_param(S0=100, K=100, T=1, exercise=\"European\", payoff=\"call\" )\n",
    "diff_param = Diffusion_process(r=0.1, sig=0.2)\n",
    "BS = BS_pricer(opt_param, diff_param)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id='sec3.1'></a>\n",
    "## BS python vs C"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Run the command `make` to compile the C [code](./functions/C/PDE_solver.c):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "os.system(\"cd ./functions/C/ && make\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Python program with Cython SOR method:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Price: 13.269170 Time: 7.882848\n"
     ]
    }
   ],
   "source": [
    "print(\"Price: {0:.6f} Time: {1:.6f}\".format(*BS.PDE_price((3000,2000), Time=True, solver=\"SOR\")))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Pure C program:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The price is: 13.269139 \n",
      " \n",
      "CPU times: user 0 ns, sys: 12 ms, total: 12 ms\n",
      "Wall time: 16.5 s\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "result = subprocess.run(\"./functions/C/BS_sor\", stdout=subprocess.PIPE, stderr=subprocess.STDOUT)\n",
    "print(result.stdout.decode('utf-8'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The C code is slower. \n",
    "### Exercise:\n",
    "Can you guess why the C code is slower?  If you know, send me an email "
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
